Data Visualization in R

[NAME]

About Myself

  • My research interest is [INSERT DESCRIPTION]
  • I use R in my research to:
    • organize data
    • develop machine learning algorithms
  • I mostly work [in this disease system or approach]
  • My contact info is [INSERT EMAIL]

Review of past workshops

You should be able to…

  • iterate a calculation
  • simulate populations
  • write a function
  • create an HTML doc with code and commentary
  • write code following best practices
  • calculate summary statistics of a dataset
  • create a figure from data

iteration

Outline

You should be able to…

  • customize figures
  • code a t-test
  • code a correlation
  • code a linear regression

Topics

  1. Approach to visualizations
  2. ggplot review
  3. Plotting challenge

    Break

  4. Inferential statistics

  5. Wrap Up

1. Approach to visualization

Visualizations are useful for data exploration and presentation.

Characteristics of data exploration visualizations

  • complex
  • minimally annotated
  • potentially pattern-less

should be avoided for presentation visualizations.

Presentation visualizations

should guide the reader through the research.

Visualization path

Presentation visualizations

We've spent a bit of time making figures for exploration, but haven't touched on presentation quality visualizations.

What are some characteristics of a good presentation visualization?

2. Review of ggplot

plots are divided into 3 fundamental parts

plot = data + Aesthetics + geometry

Minimium arguements

Using the ggplot2 cheat sheet

The cheat sheet reads like a choose your adventure book.

  1. Based on your data, go to the One, Two or Three variables section.

Using the ggplot2 cheat sheet

The cheat sheet reads like a choose your adventure book.

  1. Based on your data, go to the One, Two or Three variables section.
  2. If Two Variables decided if x and y are continuous.

Using the ggplot2 cheat sheet

The cheat sheet reads like a choose your adventure book.

  1. Based on your data, go to the One, Two or Three variables section.
  2. If Two Variables decided if x and y are continuous.
  3. Start you ggplot based on the example given under the heading

Using the ggplot2 cheat sheet

The cheat sheet reads like a choose your adventure book.

  1. Based on your data, go to the One, Two or Three variables section.
  2. If Two Variables decided if x and y are continuous.
  3. Start you ggplot based on the example given under the heading
  4. Add geom_function and additional aesthetic mapping as needed

Using the ggplot2 cheat sheet

The cheat sheet reads like a choose your adventure book.

  1. Based on your data, go to the One, Two or Three variables section.
  2. If Two Variables decided if x and y are continuous.
  3. Start you ggplot based on the example given under the heading
  4. Add geom_function and additional aesthetic mapping as needed
  5. Repeat last step until finished.

3. Plotting challenge

Often, you'll have an idea of the visualization you'd like to make.

But how to get there?

practice (and google)

3. Plotting challenge

For the visualization exercises you'll work with the people sitting in your row.

  • Open W4_Exercise\DataViz_X.html with the number corresponding to your row.
  • Complete the 2 challenges as a group. (this is a chance to practice starting a new project)
    • Every group has the same 1\( ^{st} \) challenge, but different 2\( ^{nd} \) challenge
    • When you're finished, practice explaining your Challenge 2 code.
  • We'll shuffle groups and explain the solutions to eachother.

If you finish early, add more details to the plots or try exporting as a .png.

3. Plotting challenge

What were some of the interesting things you learned?

Break

When you come back we'll be switching gears to inferential statistics

4. Inferential statistics

is the branch of statistics that deals with hypothesis tests.

R is much better for inferential statistics than other software

why?

Strength of R for statistics

R is much better for inferential statistics than other software

  • many of the common statistical analyses already exist as part of libraries
  • code based analysis means modifications to analysis are recorded

Practice using inferential statistics

We will use these tests to explore a dataset of measles cases collected from three districts in Niamey, Niger.

plot of chunk unnamed-chunk-1

Start by opening W4_Exercise\W4_Exercise.Rproj

We want to compare D1 and D2 to ask:


Question

  1. Are the average number of cases in the two districts of Niamey different from each other?
  2. Are the number of cases for each reporting period strongly associated?
  3. Does the estimated \( R_0 \) differ between districts?


Test

  1. T-test t.test()
  2. Correlation cor.test()
  3. Linear regression lm()

We want to compare D1 and D2 to ask:


Test

  1. T-test t.test()
  2. Correlation cor.test()
  3. Linear regression lm()


Result

  1. the means differ

We want to compare D1 and D2 to ask:


Test

  1. T-test t.test()
  2. Correlation cor.test()
  3. Linear regression lm()


Result

  1. the means differ
  2. strongly, positively correlated

We want to compare D1 and D2 to ask:


Test

  1. T-test t.test()
  2. Correlation cor.test()
  3. Linear regression lm()


Result

  1. the means differ
  2. strongly, positively correlated
  3. different slopes and \( \hat R_0 \)

5. Wrap Up: We can ...

  • calculate summary statistics of a dataset
  • create a figure from data
  • create an HTML doc with code and commentary
  • write code following best practices
  • write a function
  • iterate a calculation
  • simulate populations
  • customize figures
  • code basic inferential statistics

iteration